Welcome to deep learning. So in this little video we want to go ahead and look into some
basic functions of neural networks and in particular we want to look into the softmax function
and look into some ideas how we could potentially train the deep networks.
We have a search technique which is just local search gradient descent to try to find a program
that is running on these recurrent networks such that it can solve some interesting problems such
as speech recognition or machine translation and something like that. Okay so let's start.
Activation functions for classifications. Now so far we have described the ground truth by labels
minus one plus one but of course we could also have classes zero one. So this is really only a
thing of definition if we only do a decision between two classes. But if you want to go into
more complex cases you want to be able to classify multiple classes. So in this case you probably want
to have an output vector that doesn't show any pictures and here you have essentially one
dimension per class k. So k capital k here is the number of classes and you can then define a ground
truth representation as a vector that has all zeros except for one position and that is the true class.
So this is also called one-hot encoding because all of the other parts of the vector are zero
and only a single one has a one of the world in vector space. But I think this is very difficult
normal people to understand. They would not know what they're looking at. And now you try to compute
a classifier that will produce a respective vector and with this vector y hat you can then go ahead
and do the classification. So it's essentially like guessing a output probability for each of the
classes. In particular for multi-class problems this has been shown to be a more efficient way
of computing these problems. Now the problem is you want to have a kind of probabilistic output
towards zero and one but we typically have some arbitrary input vector x and that could be
arbitrarily scaled. So in order to produce now our predictions we employ a trick and the trick is
that we use the exponential function. So this is very nice because the exponential function will
map everything into a positive space and now you want to make sure that the maximum that can be
achieved is exactly one. So you do that for all of your classes. So you compute the sum over all of
the exponentials of all input vectors or of all input elements, use the exponential function on
them, sum them up and this gives you the maximum that can be attained by this conversion and you
divide by this number for all of your given inputs and this will always scale to a zero one domain
and it will have the property that if you sum up all elements of the vector it will equal to one.
This is very nice because these are two axioms of the probability distribution introduced by
Kolmogorov. So this allows us to treat the output of the network always as kind of
probabilities. And that was my 1987 diploma thesis which was all about that.
And if you look in literature or also in software examples sometimes the softmax function is also
known as the normalized exponential function. So it's the same thing. Now let's look at an example.
So let's say this is our input to our neural network. So you see this small image on the left.
Now you introduce labels for this three class problem. Wait there's something missing.
It's a four class problem. So you introduce labels for this four class problem
and then you have some arbitrary input that is shown here in the column xk. So they are scaled from
minus 3.44 to 3.91. This is not so great so let's use the exponential function. Now everything is
mapped into positive numbers and there's quite a difference now between the numbers. So we need to
rescale them and you can see the highest probability is of course returned for heavy metal in this image.
So let's go ahead and also talk a bit about loss functions. So the loss function is a kind of
function that tells you how good the prediction of a network is. And a very typical one is the so
called cross entropy loss and it's the cross entropy that is computed between two probability
distributions. So you have your ground truth and your probability distribution. So you have your
probability distribution. So you have your ground truth distribution and the one that you're
estimating and then you can compute the cross entropy in order to determine how well they are
connected, how well they align with each other. And then you can also use this into a loss function.
Here we can use the property that all of our elements will be zero except for the true class.
So we only have to determine the negative logarithm of y hat k where k is the true class.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:04 Min
Aufnahmedatum
2020-04-18
Hochgeladen am
2020-04-18 10:06:06
Sprache
en-US
Deep Learning - Feedforward Networks Part 2
This video introduces the topics of activation functions, loss, and the idea of gradient descent.
Video References:
Lex Fridman's Channel
Music Reference:
The One They Fear - The Dawn
Further Reading:
A gentle Introduction to Deep Learning